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Abstract 

We study the general approach to accelerating the convergence of the most widely used 
solution method of Markov decision processes with the total expected discounted reward. 
Inspired by the monotone behavior of the contraction mappings in the feasible set of the 
linear programming problem equivalent to the MDP, we establish a class of operators that 
can be used in combination with a contraction mapping operator in the standard value 
iteration algorithm and its variants. We then propose two such operators, which can be easily 
implemented as part of the value iteration algorithm and its variants. Numerical studies 
show that the computational savings can be significant especially when the discount factor 
approaches 1 and the transition probability matrix becomes dense, in which the standard 
value iteration algorithm and its variants suffer from slow convergence. 
Keywords: Markov decision processes, value iteration, accelerated convergence, linear pro- 
gramming. 



1 Introduction 

Consider an infinite horizon Markov decision process (MDP) with a finite set of % states 
denoted by S, a finite set of actions A(i) for each state % G S, an immediate reward r(i, a) 
for each i G S and a G A = Ui e sA(i), and a transition probability Pij(a) for each i, j G S and 
a G -A(i). The objective is to determine the maximum expected total discounted reward 
over an infinite horizon starting in state i, where A is the discount factor (0 < A < 1). It is 
well known [7\ that v satisfies the optimality equation 



and the actions attaining the maximum in Equation give rise to an optimal stationary 
policy. Let U denote the set of bounded real valued functions on S with the norm \\v\ \ = 
maxj g s \vi\, and let the function d : S —>■ A(s) specifies the action choice for state i G S (i.e., 
d(i) G A(i) for each i G S). Furthermore, let denote the |,S|-vector, with i-th component 
Td{i) = r(i, d(i)), and Pj the \S\ x l^l matrix with (i, j)-th entry given by Pd(j\i) = Pij(d(i))- 
Then the optimality equation given in Equation ([1]) can be written, with the definition of 
the operator T on U, in the following vector notation: 



where II is the set of policies. 

There are several standard methods for finding optimal or approximately optimal policies 
for the MDP. Approaches widely employed to solve MDP problems include value iteration 
and policy iteration [7j. Although simple to implement, these approaches are nevertheless 




(1) 



v = Tv = max {r d + XPjv} , 



(2) 
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limited in the size of problems that can be solved due to the excessive computation required to 
find close-to-optimal solutions. Techniques to improve the convergence of the value iteration 
algorithm have been studied in p2, 0], just to name a few. The standard value iteration 
algorithm is summarized as follows [7J 
Value Iteration (VI) Algorithm 

Step Select v° G IR 7 , set n — 0, and specify e > 0. 
Step 1 Compute v™ +l = Tv™ for all i G S. 

Step 2 If ||i; n+1 -i; n || < e(l— A)/2A, go to Step 3. Otherwise, increase 
n by 1 and return to Step 1 . 

Step 3 Return with the actions attaining the maximum in Step 1. 

An alternative approach is based on linear programming. Although linear programming- 
based approaches are generally dismissed as inefficient [7J, they have recently inspired the 
research community to find new approaches using the well-studied theory of linear program- 
ming. Widely employed linear programming formulations [21 El E] are due to Derman [3] as 
given below: 

min | Vj - \ y^ j p ij (a)v j > r(i,a),\/i e S,Wa e A(i),v G M |5| | . (3) 

Notice that the LP is a minimization problem as opposed to the given MDP being a max- 
imization problem. This is due to the implementation of the contraction mapping as con- 
straints of the LP. 

One of the recent methods rooted in the above linear programming formulation is an 
approximation approach using basis functions [2]. They use a linear combination of pre- 
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selected basis functions to fit the value function of MDP to obtain an approximate value 
function, which is then optimized in the feasible set of the linear program - named the 
approximate linear program. While the quality of the approximation heavily depends on 
the pre-selected basis functions, they fail to provide reliable guidelines on how to chose basis 
functions. The feasible set V of the equivalent linear program can be written in vector 
notation as follows: 

V = {v e R |51 | v > Tv}. (4) 

We propose a new class of operators that can be integrated into the standard and variants 
of the value iteration algorithm for Markov decision processes so as to speed up the conver- 
gence of the iterative search. The speed of convergence will be measured in both the number 
of iterations and the CPU time. The characterization of the operator class is motivated by 
the monotone behavior of contraction mappings in set V and the ideas of the basis function 
approximation. The major contribution of this paper is the identification of a new class of 
operators which accelerate the value iteration algorithm and its variants. The observation of 
monotone behavior of the contraction mapping in set V in Equation (j3J) would also inspire 
the research community to explore other types of operators. 

The rest of the paper is organized as follows. Section [2] introduces conditions which 
need to be satisfied by accelerating operators, that are to be combined with value iteration 
operators. Section [3] presents the performance improvements due to the acceleration as well 
as a discussion on computational complexity of the acceleration operators. In section [4] we 
discuss the application of the acceleration operators to the expected total reward, semi- 
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Markov and continuous-time Markov decision processes. Finally, section [5] concludes the 



paper. 



2 Accelerated Value Iterations Algorithms 

In this section we will introduce a class of operators that will speed up the convergence of the 
standard value iteration and its variants. Operators in this class can be used in combination 
with the contraction mapping of the value iteration algorithm and its variants. We will also 
present two such operators later in this section. 

The most crucial observation, that leads to characterization of the acceleration operators, 
is that the operator T is a component-wise monotone mapping in V and the set V is invariant 
under T as formally stated in Lemma 12. 1[ 

Lemma 2.1. V is invariant under T. That is, TV C V. 

It is also known [7] that the optimal solution v* of the MDP or the fixed point of the 
operator T given in ([2]) is such that 



v* < v, Vf e V. 



Lemma 12.11 suggests the following conditions for acceleration operator Z on the set V . 
Acceleration Conditions 

(A) ZV C V, 

(B) Zv < v, Vv E V. 



One can see that the identity operator and the operator T satisfy both of the conditions. 
For a given operator satisfying two conditions (A) and (B), an accelerated value iteration 
algorithm can be defined. Below is a summary of the Generalized Accelerated Value Iteration 
Algorithm (GAVI) proposed in the present paper. 

General Accelerated Value Iteration Algorithm (GAVI) 
Step Select w° G V , set n = 0, and specify e > 0. 
Step 1 Compute = ZTw™ for all i G 3. 

Step 2 If \\Tw n -w n \\ < e(l-A)/2A, go to Step 3. Otherwise increase 
n by 1 and return to Step 1 . 

Step 3 Return with the actions attaining the maximum in Step 1. 

Notice that different acceleration operators may be used in different iterations of the 
value iteration algorithm, in which case Z n is an acceleration operator used in Step 1 of n-th 
iteration, instead of Z. In this paper we restrict our attention to the case where Z n = Z for 
all n. 

Remark 2.1. GAVI must be initialized with w° G V. It is shown in [8] that there always 
exists a feasible solution «(g V) of the equivalent LP in form u(i) = al for all i, where 1 
is a vector of ones. Indeed, taking a = (1/(1 — A)) max ie s iag ^(j) r(i, a), we will have u be a 
feasible solution of the above LP. 

Remark 2.2. It is easy to show that if v* is the fixed point of operator T, v n+1 = Tv n , and 

w° = v °. Then 

(i) v* <w n < v r \ 

(ii) v* = limn-oo w n = lim^^ v n , 
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(iii) \\v* — w n \\ < \\v* — v n \\. 

(iv) The sequence {w n } converges globally on V(for all w° G V) with order 1 at a rate less 
than or equal to A; its global asymptotic average rate of convergence is less than or 
equal to A 1 . 

As mentioned earlier, T itself is such an operator, resulting in an algorithm that performs 
Step 1 of VI twice per iteration. Trivially, such modification accelerates the convergence of 
the algorithm but each iteration is twice as expensive as VI. Therefore, it is important to 
identify an acceleration operator satisfying two conditions (A) and (B) that requires little 
additional computation, so that the reduction in the number of iterations before convergence 
is more dramatic than the increased computation per iteration. If the condition (B) holds 
strictly, the improvement per iteration is guaranteed to be more than that of the standard 
value iteration (VI). 

Now we propose an acceleration operator that requires little additional computation per 
iteration but reduces the number of iterations significantly. Understanding that the value 
iteration algorithm is simply finding a component-wise minimum point in the set V, any 
operator that maps a given point (vector) v in V to another vector u in V such that u < v 
with the strict inequality with at least one component will accelerate the convergence. 

Projective Operator 

For v G V, Z : v — > a*v, where a* is the optimal solution of the following optimization 
^^An extensive discussion of convergence orders and rates can be found in [7]. 
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problem: 

I a Vi | T(av) < avj . (5) 



mm 



We will call GAVI with Projective Operator as Projective Accelerated Value Iteration or 
PAVI for short in the paper. It is interesting to view Projective Operator as an approximation 
approach using a basis as is proposed in [2j since the optimization problem given in (JHJ) is 
nothing but the approximate LP with a single basis function v found by the standard operator 
T in Step 1 of VI or GAVI. The choice of basis functions is not an issue in this approach 
since V is invariant under T and T is a contraction mapping. That is, T finds a vector 
v G V so that optimizing the approximate linear program with a single decision variable, 
denoted by a in the definition of Projective Operator, results in a new vector closer to the 
fixed point. The role of Projective Operator is graphically illustrated in Figured], where Z 
projects the given point v G V to the boundary of V. 



Figure [T] goes here. 



Theorem 2.1. Projective Operator Z satisfies the conditions (A) and (B) ifr(i,a) > for 
all i G S, for all a G A(i). 
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Remark 2.3. The non-negativity condition is not really restrictive. The rewards can be 
adjusted to be non-negative for all states and actions 

r(i,a)' = r(i,a) + max {\r(j, a)\\, Vz G S, Va G A(i). 

jeS,a£A(j) 

The new MDP shares the optimal policy with the original so that it will not matter which 
MDP to solve. For the expected total discounted MDP, the reward adjustment can also be 
considered as the coordinate translation as follows: 

v' i = v i + max {\r(j, a)\/ (1 — A)}, Vi G S, 

jeS,a£A(j) 

Let dV denote the set of the boundary points of V and int(V) the set of the interior 
points of V, i.e., int(V) — V \ dV. 

Lemma 2.2. Suppose that is fully dense (no zero entry) for all d 6 II, then Tv < v for 
allveV with v ^ v*. 

Remark 2.4. Notice that if the transition probability matrix Pd is fully dense for all d G II, 
the strict inequality in (i) and (iii) of the Remark 12.21 is guaranteed, i.e. the sequence of 
iterates PAVI converges faster than that of VI. This is because it is always true that for all 
v & V with v 7^ v*, Tv G int(V) and Projective Operator maps Tv G int(V) to a vector 
w G dV. It is obvious that w < v. It worth to mention, however, that strict inequalities in 
(i) and (iii) do not guaranty that the stopping criteria of value iteration algorithm will be 
satisfied strictly faster. 

Lemma 2.3. The interior ofV is invariant under T. That is, 

T(int(V)) C int{V). 
8 



Lemma [2.31 suggests a variation of Projective Operator for which the strict inequality in 
(i) and (iii) of the Remark 12.21 is guaranteed, which allows the sequence of iterates PAVI to 
converge faster than that of VI in sense of (i) and (iii). Given a factor /3(0 < (3 < 1), one 
can define a variation of Projective Operator as Z^u = (1 — f3)Zu + (3u, which always finds 
an interior point between Tw 11 and ZTw n = w n+1 so that in the next iteration Tw n+1 is 
guaranteed to be an interior point of V. 

Now we present a new acceleration operator that satisfies Acceleration Conditions 
(A) and (B). 

Linear Extension Operator 

For v G V, Z : v — > v + a*(u — v), where u = Tv and a* is the optimal solution to the 
following optimization problem: 

min + « /^( u i ~ v i) I T(v + a(u — v)) < v + a(u — v)j . (6) 

Figure [2] graphically illustrates how Linear Extension Operator works. It casts Tw n in the 
direction of Tw 11 — w n to the boundary of set V. For w n G V, we have Tw n < w n , implying 
that Tw n —w n is an improving direction within set V. As a result, Linear Extension Operator 
moves Tw n closer to the fixed point. An improvement is guaranteed whenever Tw n G int(V). 
When the transition matrix is fully dense for all d G II, such a improvement is also 
guaranteed. We can also define a variation of Linear Extension Operation as we did with 
Projective Operator to guarantee a strict inequalities in (i) and (iii) of the Remark 12.21 even 
when the matrices are not fully dense. 
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Figure [2] goes here. 



Theorem 2.2. Linear Extension Operator Z satisfies the conditions (A) and (B). 



When Linear Extension Operator is used in Step 1 of GAVI, we call the algorithm as 
Linear Extension Accelerated Value Iteration or LAVI for short in the paper. 

In what follows we will combine the two acceleration operators - Projective Operator 
and Linear Extension Operator - with variants of the standard value iteration algorithms: 
Jacobi, Gauss-Seidel, and Gauss-Seidel-Jacobi value iteration algorithms. 
Jacobi: v n+1 = Tjv n where 



77-1-1 

v i = max 

a£A(i) 



i) + X^2pij(a)vJ I 1 - Xpu(a) > 



Vz G S. 



Gauss-Seidel: v n+1 = Tcsv n where 



77-1-1 

v 4 = max 



aeA(i) 

Gauss-Seidel-Jacobi: v n+1 = T GSJ v n where 



I r(i, a) + \J2Pij( a )Vj +1 + A^/^(V)r; I 

L j<i j>i J 



Vz G S. 



(7) 



77-1-1 

v- = max 



1 - \pu(a) 



Vz G S. (9) 



Notice that the definition of the operators T GS and T GSJ depends on the order of state index. 



We will show that these operators can be used in the position of the standard operator T in 
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GAVI (and hence in PAVI and in LAVI). We start with the following definition of sets: 

Vj = {v e M |s| | v > Tjv}, 
V GS = {v e | v > T GS v}, 
V GS j = {ve | v > T GSJ v}. 

The following lemma is an analogue of Lemma 12.11 

Lemma 2.4. Vj, Vgs, an d Vgsj ore invariant under Tj,Tgs, andTcsj respectively. 

With Lemma l2~4l acceleration operators satisfying conditions (A) and (B) can be used in 
Step 1 of GAVI with the variants Tj,Tqs, and Tqsj of the standard operator T. However, 
it is not trivial to define V G s and Vgsj with a set of linear inequalities and the acceleration 
operators proposed in this paper will not work. To avoid the problem, we should restrict the 
acceleration operators to a strict subset of Vgs and Vgsj- In this paper we use V since it is 
a strict subset of Vgs and Vgsj as shown in Lemma 12.51 below. 

Lemma 2.5. The following relations hold 

V = Vj, V gs = V GSJ , and V C V GS . (10) 

Remark 2.5. Gauss-Seidel methods require special consideration, since in general V ^ Vgs- 
Indeed, let's consider an MDP with two states and one action per state with a transition 
probability matrix (J J), and a reward vector r = (j). Then v > Tv is equivalent to 
v i > 1 + Xv 2, v 2 > 1 + Xv\, while v > TqsV is equivalent to V\ > 1 + Xv 2, v 2 > 1 + A(l + Xvi), 
and it is easy to see that these systems are different. 

11 



Lemma 2.6. 

T GS (V GS )cV, and T GS j{V GS j) C V. (11) 

Theorem 2.3. The set V is invariant under Tj, T GS , and T GSJ . That is, 

TjVcV, T GS V C V, T GSJ VCV. 

Theorem 12.31 states that V is invariant under Tj, T GS , and T GSJ , which suggests that 
these operators can replace T in GAVI to give rise to new accelerated value iteration algo- 
rithms. Therefore, we obtain 8 accelerated versions of the value iteration algorithm, which 
are conveniently written in the form XAY, where 'X' stands for either "Projective" or "Lin- 
ear Extension", 'A' for "Accelerated", and 'Y' for one of VI, J, GS, or GSJ. For example, 
LAGS denotes Linear Extension 'Accelerated Gauss-Seidel method with w n+1 = ZT GS for 
Step 1. Non-accelerated versions will be shortened to VI, J, GS, and GSJ without prefixes. 

Remark 2.6. There are other modifications of VI based on the splittings of transition prob- 
ability matrix [7] . We believe that the strong inclusion demonstrated in Lemma 12.61 holds 
for the most practically useful splittings, but to avoid unnecessary complications we do not 
prove the statement in the most general setting. 

3 Numerical Studies 
3.1 Random examples 

In this section we present numerical studies to demonstrate the computational improvement 
that the two accelerating operators achieve over the standard value iteration algorithm as 
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well as its variants. We will consider two families of randomly generated MDP problems. 
In all cases the number of actions in each state was generated using a uniform random 
number generator between 2 and 99 and the immediate reward for each state and action pair 
was generated using a uniform random number generator in the range (1, 100). Transition 
probability matrices were generated differently between the two families. In Example 1, 
the transition probability matrices have non-zero elements distributed uniformly, while in 
Example 2, non-zero elements are only in a band around the diagonal. 

In Example 1, we first fixed the number of non-zero entries in each row so that the 
density of non-zero entries in that row should be equal to a given density level. We randomly 
generated non-zero entries according to a uniform distribution over (0,1), normalized these 
non-zero entries so that they add up to give the sum of 1, and then placed them randomly 
across the row. In Example 2, non-zero entries were generated and placed only around the 
diagonal so that the non-zero entries form a band around the diagonal in the matrix. The 
number of non-zero entries in each row was determined for a given density level. 

We also used exponential distributions instead of uniform distributions but the numerical 
results did not show any noticeable differences in performance. Therefore, we only present 
numerical studies with uniform distributions in this paper. 

In both Example 1 and 2, the density of the non-zero elements in the matrices varies 
between 20% and 100% and the discount factor is set to 0.9, 0.98, and 0.995. This results 
in 27 problem instances in each example, which are then solved using the standard value 
iteration algorithm and its three variants with and without our acceleration operators. 
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Example 1. Consider MDPs with 500 states and up to 100 actions per state. The transition 
probability matrices were generated using a uniform random number generator and non-zero 
elements were uniformly placed in the matrix. The density of non-zero elements of the 
matrices varies from 20% to 100%. The discount factor is set to 0.9, 0.98, and 0.995 and 
the tolerance used in the stopping condition is e = 1CT 3 . 

The computational results of Example 1 are presented in Table [TJ where the first and the 
second numbers in each cell are the number of iterations and the CPU time to converge to 
fixed points, respectively. PAVI improves the computational efficiency measured in the CPU 
time up to 605 times (PAVI combined with Jacobi value iteration when the density is 50% 
and the discount factor is 0.995). In general, PAVI has a larger improvement than LAVI, 
although LAVI is better in terms of the worst case comparison. Both operators improve the 
computation as the discount factor approaches 1, which is the case where the value iteration 
algorithm and its variants suffer from slow convergence. 



Table [T] goes here. 



Example 2. Consider MDPs with 500 states and up to 100 actions per state. The transition 
probability matrices are band matrix - matrix with non-zero elements only in a band centering 
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around the diagonal of the matrix - with a bandwidth varying from 100 to 450 (which gives 
us the density of the matrix varying from 20% to 90%). When the density is set to 100%, 
Example 2 becomes identical to Example 1. The transition probabilities inside the band were 
generated using a uniform random number generator. The discount factor is set to 0.9, 0.98, 
and 0.995, and the tolerance used in the stopping condition is e = 1CT 3 . 

The computational results are presented in Table EJ The general trend in Example 2 is 
similar to that in the earlier example except for the acceleration being less dramatic. The 
maximum acceleration in terms of the CPU time is observed when the discount factor is 
0.995 and the density is 80%, and PAVI accelerates the convergence 217 times over Jacobi 
value iteration. Similar to Example 1, the PAVI outperforms LAVI, both operators perform 
best when combined with the standard value iteration, and both accelerators perform better 
as the discount factor approaches 1 and the density approaches 100%. 

Let us now present the brief analysis of the above numerical results. Based on Example 
1 and Example 2, we can conclude that PAVI generally performs better than LAVI, and 
that the acceleration from PAVI is especially beneficial when it is combined with the stan- 
dard value iteration and Jacobi methods. LAVI exhibits a relatively minor acceleration with 
these two methods and shows stronger performance when it is combined with Gauss-Seidel 
and Gauss-Seidel- Jacobi algorithm. We propose the the following explanation to this be- 
havior. As one can notice, the impact of the projection operator is more significant when 
the Markovian operator moves the points deeper inside the set V rather then closer to fixed 
point. Because of this, combining the projection operator with the standard VI and Jacobi 
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VI is more advantageous than combining it with Gauss-Seidel and Gauss-Seidel-Jacobi VI. 
Moreover, it is assumed that than the higher the density of the transition probability matrix 
the deeper inside of the set V the point will be moved by VI in any of the above forms. This 
assumption is related to the result of Lemma [2T2l where fully dense matrices are the limiting 
case of the dense matrices. One can see, especially from results in Table El the denser the 
transition probability matrix, the better the performance of PAVI combined with standard 
and Jacobi VI. One more remark should be made. 

Remark 3.1. Notice that in some cases it turns out that PAVI+GSJ requires significantly 
more iterations than PAVI+GS. The reason is that the coordinate \S\ of u = Tqsjv is 
expressed in terms of Uj, j < \S\. Therefore for a certain a G -<4(|<S'|) we have the equality 



which easily implies that u G dV. Therefore, though it is not guaranteed, it is possible that 
PTqsjv = T GS jv for all v G V. 

On the other hand, one can notice that Gauss-Seidel and Gauss-Seidel-Jacobi VI usually 
converge faster than the standard VI and Jacobi VI, but they iterate points closer to facets of 
set V, which reduces the impact of using projection operators. As suggested by geometry of 
set V and supported by the numerical examples, combining Gauss-Seidel and Gauss-Seidel- 
Jacobi VI with a linear extension operator is more advantageous than combining standard 
VI and Jacobi VI with it. 

For almost all of the cases in Table 1, the combinations of PAVI + S and PAVI + J are 
the best combinations; however, for Table 2 one can see that for cases with sparse transition 




j<\s\ 
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probability matrices and A close to 1, the combinations LAVI + GS and LAVI + GSJ are 
more advantageous. The numerical studies show that combinations of PAVI + S and PAVI 
+ J perform better when the transition probability matrices are relatively dense ( 10% or 
more), while for sparse matrices the combinations LAVI + GS and LAVI + GSJ may perform 
better. For very sparse matrices (0.1% or less) the performance of all variants of GAVI is 
relatively poor. 

In general, the acceleration tends to be more dramatic as the density increases and 
as the discount factor approaches 1. This is encouraging, since the standard value iteration 
algorithm and its variants have been known to suffer from slow convergence when the discount 
factor is close to 1 and requires more computation as the transition probability matrix 
becomes denser. Therefore, our accelerators will perform excellently when they are needed 
the most. 

3.2 Computational Complexity and Savings 

We now present the computational cost of the proposed operators. For the standard VI, when 
the transition probability matrices are fully dense, each iteration will take C\S\ 2 (where 
C is the average number of actions per state) multiplications and divisions. With sparse 
transition probability matrices, this number can be estimated as iVC|S'| (where N is the 
average number of non-zero entries per row of the transition probability matrices). 

The additional effort required in GAVI is due to the acceleration operator used in Step 
1 of GAVI. Both accelerators require the computation of J^jPiji^Tw]" 1 for alii e / for 
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all a G A(i) in n-th iteration of GAVI. Notice that we will have to know Ylj Pij( a ) w j' 
in the next iteration in order to compute Tw n . For example, for PAVI w n = a*Tw n ~ 1 
and 'Y^jPij{o)w 1 j = a*^2jPij(a)Tw 1 J~ 1 . Therefore, we can store ^jPiji^Tw™ -1 between 
iterations to avoid redundant computation. A similar approach can be applied to the three 
variants of acceleration operators studied in this paper. Therefore, each iteration of GAVI 
requires CISTS' + 1| multiplication and division, which is just slightly more than that of 
standard value iteration. 

From the above we can draw the following conclusion: the additional computation due to 
the accelerating operators is marginal. Though it varies among different problem instances, 
the average additional computation per iteration observed in 54 instances from Example 1 
and Example 2, each of which is solved using 12 different approaches, is 12% with PAVI and 
15% with LAVI. 

4 The Expected Total Reward and Other MDPs 

In this section we will show that GAVI discussed in Section[2]can also be used with other types 
of MDP: MDPs with the expected total reward criterion, semi-Markov decision processes, 
and continuous-time MDP. 

Consider an infinite horizon MDP with expected total reward criterion. One can notice 
that both the coordinate translation and the rewards adjustment described in Remark 12.31 
cannot be applied to this model. This is because of A = 1 for the coordination translation and 
the non-zero reward at absorbing states for the reward adjustment. Therefore, the proposed 
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techniques are applicable only to positive models (and also to negative models with trivial 
changes in the definition of the acceleration operators). 

Recall that the value iteration algorithm for MDP with expected total reward criterion 
is based on successive application of an operator L to a vector v G M' 5 ' [7], 

(Lv)i = max < r(i,a) + pij(a)vj > . (12) 

aeA(i) ^ ^—^ J 

Since the operator L in Equation ( fl2|) is not a contraction mapping, the convergence of 
successive approximation algorithms using the operator L is not guaranteed. However, with 
v° satisfying the conditions < v < v * or v * < v°, the sequence {f T1 }^ =0 converges mono- 
tonically to v* [7|. 

The LP formulation equivalent to an MDP with expected total reward criterion is given 
as follows: 

min \vi- *y]pij(a)vj > r(i,a),Vi G S,Wa G A(i),v G 1R |5| | , (13) 

where v*, the optimal solution of the LP, yields the fixed point of the MDP. Analogous with 
the discounted MDP considered in Section [TJ the feasible set W of the LP in Equation (TlBl 
can be described as follows: 

W = {v G M) sl I v > Lv}. (14) 

The following Lemma states that the same approaches as introduced in Section [2] are appli- 
cable to MDPs with the total expected reward. 

Lemma 4.1. W is invariant under L. That is, LW C W. 
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With the invariance of W under L, not only GAVI(General Accelerated Value Iteration) 
algorithm but also all of its variants can be applied directly to MDPs with the expected total 
reward criterion. 

Remark 4.1. Jacobi and Gauss-Seidel-Jacobi variants of the value iteration algorithm are no 
longer valid for MDPs with the expected total reward criterion. 

The value iteration algorithm for MDPs with the expected total reward has been known 
for its very slow convergence. Our acceleration operators, which improve the efficiency by an 
increasingly margin as the discount factor approaches 1, are an attractive option for MDPs 
with the expected total reward, which can be seen as MDPs with the discounted reward with 
a discount factor 1. Example [3] below confirms this speculation. 

Example 3. Consider an MDP with 5 states and up to 5 actions per state. Let one of the 
states be an absorbing state with zero reward. The standard value iteration takes 293 itera- 
tions, while our acceleration value iteration with Projective Operator takes only 15 iterations 
before meeting the stopping condition (e = 10~ 3 ). 

Now we will make a short remark about semi-Markov and continuous-time decision pro- 
cess. It is widely known that discounted infinite-horizon semi-Markov decision processes and 
continuous-time MDPs can be transformed into models similar to a discrete time discounted 
model with rewards r(i,a) depending only on state at decision epoch and action [6j [7]. 
Therefore, our acceleration operations can be used for these types of MDPs after they are 
converted into discrete time MDPs with the total expected discounted reward. 
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5 Conclusions 

Inspired by the monotone behavior of the contraction mapping operator used in the value 
iteration algorithm within the feasible set of the linear programming problem equivalent to 
the given MDP, we propose a class of operators that can be used in combination with the 
standard contraction mapping and its variants, such as Jacobi, Gauss-Seidel, and Gauss- 
Seidel-Jacobi methods, to improve the computational efficiency. Two acceleration operators, 
Projective Operator and Linear Extension Operator, are particularly proposed and com- 
bined into the standard value iteration algorithm and its variants. The savings due to the 
acceleration have been dramatic and the maximum savings is up to 607 time faster than 
the case without our accelerating operator. The numerical studies show that combinations 
of PAVI + S and PAVI + J perform better when the transition probability matrices are 
relatively dense ( 10% or more), while for sparse matrices the combinations LAVI + GS and 
LAVI + GSJ may perform better. It is especially interesting to mention that the savings 
become more dramatic when encountering problems for which the standard value iteration 
algorithm suffers from slow convergence (when the discount factor approaches 1 and when 
the transition probability matrix becomes dense). The strong performance with the discount 
factor close to 1 can be of special interest to those who are investigating Blackwell Optimal- 
ity, which assures that a Blackwell optimal policy is optimal for all discount factors A& < 1 
close enough to 1 [H [10] . 
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A Proofs 

Lemma A.l. 

(i) If u > v, then Tu > Tv, Tju > Tjv, T G su > T G sv, and T G sju > T G sjv. 

(ii) Ifu>v, then Tu > Tv, Tju > Tjv, T GS u > T GS v, and T GSJ u > T GSJ v. 
Proof of Lemma \A.l\ Let us first prove that u > v implies Tu > Tv. 



Tui = max 

a&A{i) 



<|r(i, a) + A ^^pjj(a)-Uj j> > m ax <|r(i, a) + A ^^Pij(a)^ ^ = Tvi for all i £ S. 



For part (ii) the inequality can be simply replaced with a strict inequality. The same proof 
applies for Tju > Tjv. Let w = T G su and £ = T G sv. Then w\ = T G sU\ = Tu\ > Tv\ = 
£1 for all i £ S. By induction, assuming that Wk > for all k < i, we get for j = i 



max ( r(i, a) + A > Pij(a)wj + A > PiAa)Uj I 

> max ( r(i, a) + AVp i3 {a)( i + A VVy(a)^- J = {T GS v)i = 

This proof can be trivially modified for T GSJ u > T GSJ v,T GS u > T GS v, and T GSJ u > T GSJ v. 

□ 

Remark A.l. Notice that proof of Lemma [A. II is good even when A = 1. 

Proof of Lemma \2. 1\ Let v £ V and w = Tv. By definition of set V, u — Tv < v. By 
monotonicity shown in Lemma [A. 11 Tu < Tv = u. Thus, u £ V. □ 

Proof of Theorem \2.1[ Condition (A) is satisfied trivially since av £ V for any v £ V by the 
definition of Z given in dSJ). Now we have to show that Z satisfies condition (B). We know 
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a = 1 is feasible to the linear program (jSJ) since v G V (or Tv < v). Since J2i v i — due to 
rfi, a) > 0, Vi, a, we have a* < 1. Therefore, Zv = a*v < v. □ 



Proof of Lemma \2. 2\ Suppose that the transition probability matrix is fully dense (i.e., 
Pij(a) > 0, Wi,j,a). Then if v > u = Tv and v ^ v*, where v* is the fixed point of T, 
there exists k such that Vk > Uk- As a result, 



Tui = max < r(i, a) + \} pij(a)uj + \p ik (a)uk 
aeA(l) I m 

< max < r(i, a) + X > p il (a)f + An, 



'a)vk > — v i for all i E S. 



□ 



Proof of Lemma \2.3[ Let u G int(V) and u = Tf. By definition of mt(V), m = Tv < v. By 
monotonicity shown in Lemma [A. II Tu < Tv = u. Thus, u G int{V). □ 

Proof of Theorem \2.2\ For v G V, Z(v) = v + a*(Tt> — t>), where a* is an optimal solution 
to the linear program in [HI Since v + a*(Tv — v) is feasible to the linear program, we have 
T(v + a*(Tv — v)) < v + a*(Tv — u). This suffices Condition (A). By Tv = u G V, a = 1 is 
feasible. Since Tt> < v and a = 1 is feasible, a* > 0. Hence, Zt> < v and Condition (B) is 
satisfied. □ 



Proof of Lemma \2.J\ This proof is similar to the proofs of Lemma 12.11 and 12.31 □ 
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Proof of Lemma \2.5[ First let's show that Vj G V. From (171), v n+1 = Tjv n , where 



r)-t-1 

v { = max 



+ \^p i j(a)v r j 

3& 



> 



r(i,a) + Xy]pij(a)v? 



I [1 - Ap«(a)] 
/[l-Ap«(a)],Vae A(z) 



which then implies 



ra+1 „,n+l 



< +1 Apij(a) > r(i, a) + A^p y (o)i;? 



or 



< +1 >r(z,a)+A^^(a)^. 

i 

Therefore, Vj G V. Repeating the same steps and moving backward it is easy to show that 
V C Vj. Similarly we can show Vas — Vqsj- In order to prove inclusion V C Vgs, it is 
sufficient to show that if v > Tv, then v > Tqsv. For v G V, let u — Tqsv and w = Tv. 
Then, Wj < Vj for all j and U\ — W\. Assume that Uk < Wk for all k < i, then 



Mi = max 



(r(i, a) + Xj2pij(a)uj + Xj2Pij(a)vA 

j<i j>i 



< 



a£A{i) 



max ( r(i, a) + A } Pij(a)Wj + X > Pij(a)vj 



< max I r(i, a) + A > Pij(a)vj + X Pij(a)Vj J 



(Tv)i = Wi 



j<i j>i 

By induction, w > u, implying v > Tv = w > u = Tqsv. 



□ 



Proof of Lemma \2. b\ For v G Vgs, let u — Tqsv. By Lemma lA.ll v > Tqsv = u. By 



replacing "max agj 4(j)" by inequalities, similar to the argument used in the proof of Lemma l2.5[ 
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and by v > u, we have 

U{ > r(i, a) + A V^j?y(a)i«rj + A ^Jj>y(a)^ for all i E I for all a G 

> r(z, a) + A ^^pjj(a)uj + A yj^jj for alii G / for all a G 

which is equivalent to « > Tm, or u <E V. If f G Vqsj, then letting u = T GS j by similar 
argument we get u > Tju. Since Vj — V, u G V. □ 

Proof of Theorem \2.3[ By monotonicity shown in Lemma lA.14 TjVj C Vj. Therefore, by 
V = Vj in Lemma I2.5[ TjV C V. By Lemma 12.51 and Lemma 12.61 

T GS V C T G5 Vg 5 C V and T G5J V C T GSJ V GS j C V. 

□ 

Proof of Lemma \4-l\ The proof is similar to proofs of Lemma 12.11 and 12.31 □ 
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A 


0.9 


0.98 


0.995 


Dens. 1 ^ 


VI 


None* 


PAVI* 


LAVI* 


None* 


PAVI 1 


LAVI 1 


None* 


PAVI* 


LAVI* 


100 


S 


201 / 49 


7/2 


57 / 15 


1164 / 287 


8/2 


339 / 91 


5104 / 1257 


8/2 


1504 / 397 


J 


201 / 56 


7/2 


56 / 17 


1163 / 324 


8/3 


339 / 102 


5097 / 1427 


8/2 


1508 / 451 


GS 


75 / 18 


58 / 26 


33 / 15 


419 / 103 


317 / 139 


180 / 80 


1818 / 449 


1188 / 522 


774 / 343 


GSJ 


74 / 20 


59 / 26 


28 / 13 


417 / 117 


336 / 148 


110 / 50 


1811 / 503 


1485 / 653 


379 / 169 


90 


S 


201 / 47 


7 / 1 


58 / 15 


1164 / 273 


7/2 


335 / 84 


5104 / 1196 


8/2 


1498 / 377 


J 


201 / 53 


6/2 


57 / 17 


1163 / 308 


7/2 


336 / 95 


5098 / 1355 


8/2 


1493 / 418 


GS 


75 / 17 


57 / 24 


26 / 11 


418 / 98 


306 / 127 


163 / 68 


1817 / 427 


1081 / 447 


743 / 310 


GSJ 


74 / 20 


58 / 24 


20/9 


417 / 111 


333 / 139 


97 / 42 


1810 / 478 


1474 / 614 


310 / 131 


80 


S 


201 / 43 


7/2 


58 / 13 


1164 / 248 


8 / 1 


342 / 78 


5102 / 1081 


9/2 


1521 / 345 


J 


201 / 48 


7/2 


57 / 15 


1162 / 277 


8/2 


340 / 89 


5095 / 1216 


9/2 


1504 / 393 


GS 


75 / 16 


56 / 21 


32 / 13 


418 / 89 


275 / 101 


180 / 68 


1815 / 385 


788 / 290 


768 / 285 


GSJ 


74 / 17 


58 / 22 


26 / 10 


416 / 98 


334 / 124 


121 / 46 


1808 / 429 


1477 / 546 


399 / 149 


70 


S 


201 / 37 


7 / 1 


59 / 12 


1163 / 223 


8 / 1 


348 / 70 


5100 / 973 


9/2 


1530 / 310 


J 


201 / 42 


7/1 


58 / 14 


1162 / 246 


8 / 1 


345 / 77 


5093 / 1080 


9/2 


1538 / 354 


GS 


75 / 14 


59 / 19 


31 / 11 


418 / 81 


330 / 108 


174 / 57 


1815 / 342 


1365 / 447 


769 / 253 


GSJ 


74 / 16 


59 / 19 


31 / 11 


416 / 88 


337 / 111 


142 / 47 


1807 / 382 


1488 / 489 


443 / 147 


60 


S 


201 / 33 


8 / 1 


62 / 11 


1163 / 190 


9 / 1 


356 / 62 


5098 / 835 


10 / 1 


1558 / 272 


J 


201 / 37 


8 / 1 


61 / 12 


1161 / 211 


9/2 


353 / 70 


5092 / 930 


10/2 


1549 / 304 


GS 


74 / 12 


60 / 17 


33/9 


417 / 71 


342 / 95 


108 / 31 


1813 / 297 


1510 / 419 


418 / 117 


GSJ 


74 / 13 


60 / 17 


33/9 


416 / 75 


341 / 96 


106 / 30 


1806 / 327 


1505 / 421 


441 / 125 


50 


S 


201 / 31 


8 / 1 


62 / 10 


1165 / 180 


9/2 


353 / 58 


5106 / 789 


9/2 


1546 / 257 


J 


201 / 34 


8 / 1 


61 / 11 


1163 / 199 


9/2 


353 / 66 


5098 / 873 


9/1 


1543 / 292 


GS 


75 / 12 


57 / 15 


32/9 


419 / 65 


285 / 74 


180 / 48 


1820 / 281 


855 / 222 


788 / 208 


GSJ 


74 / 13 


59 / 15 


32/9 


417 / 72 


336 / 87 


108 / 29 


1812 / 312 


1486 / 387 


322 / 86 


40 


S 


201 / 28 


8 / 1 


62/9 


1164 / 163 


9 / 1 


355 / 52 


5105 / 715 


10 / 1 


1571 / 235 


J 


201 / 31 


8 / 1 


61/9 


1163 / 178 


9/2 


353 / 58 


5099 / 784 


10/2 


1565 / 262 


GS 


75 / 10 


59 / 13 


32/7 


418 / 61 


336 / 77 


101 / 24 


1816 / 255 


1486 / 340 


111 / 26 


GSJ 


74 / 12 


58 / 13 


30/7 


417 / 64 


335 / 78 


105 / 24 


1810 / 277 


1481 / 342 


139 / 33 


30 


S 


201 / 25 


8 / 1 


64/8 


1164 / 145 


9/2 


358 / 48 


5102 / 635 


10 / 1 


1584 / 210 


J 


201 / 27 


8/1 


63/8 


1162 / 157 


9/1 


356 / 55 


5096 / 689 


10 / 1 


1580 / 221 


GS 


74/9 


58 / 12 


32/7 


418 / 52 


334 / 66 


82 / 16 


1813 / 232 


1479 / 291 


113 / 23 


GSJ 


74 / 10 


58 / 12 


32/7 


416 / 56 


333 / 66 


71 / 15 


1807 / 245 


1473 / 292 


134 / 27 


20 


S 


201 / 21 


8/1 


64/7 


1163 / 122 


9/1 


361 / 40 


5101 / 534 


10 / 1 


1605 / 178 


J 


201 / 22 


8/1 


63/8 


1162 / 130 


9/1 


360 / 47 


5093 / 574 


10 / 1 


1600 / 199 


GS 


74/8 


58/9 


31/5 


417 / 44 


332 / 53 


86 / 14 


1813 / 190 


1471 / 233 


114 / 19 


GSJ 


74/9 


58 / 10 


23/4 


415 / 46 


331 / 54 


70 / 11 


1804 / 201 


1463 / 236 


190 / 31 



f: The density of the transition probability matrix (%). 

p. Two values in these columns correspond to the number of iterations and CPU time of the corresponding 
algorithm. 



Table 1: The number of iterations and CPU time of the standard, Jacobi, Gauss-Seidel, and 
Gauss-Seidel-Jacobi value iteration algorithrja§ with and without an accelerating operator 
applied to a family of MDPs from Example [T] 



A 


0.9 


0.98 


0.995 


Density^ 


VI 


None! 


PAVI* 


LAVI* 


None* 


PAVI* 


LAVI* 


None! 


PAVI* 


LAVI* 


90 


S 


201 / 36 


15/3 


65 / 12 


1164 / 209 


20/4 


385 / 75 


5102 / 919 


23/4 


1711 / 336 


J 


201 / 42 


15/3 


64 / 13 


1162 / 244 


20/4 


384 / 79 


5092 / 1081 


23/5 


1710 / 343 


GS 


75 / 14 


59 / 16 


29/8 


419 / 76 


334 / 89 


171 / 46 


1818 / 327 


1465 / 387 


752 / 201 


GSJ 


75 / 16 


58 / 16 


21/6 


417 / 87 


333 / 89 


123 / 34 


1815 / 382 


1473 / 391 


321 / 87 


80 


S 


201 / 33 


15/2 


64 / 10 


1164 / 179 


22/3 


378 / 63 


5103 / 785 


26/4 


1678 / 282 


J 


201 / 36 


15/2 


63 / 11 


1162 / 209 


22/4 


377 / 69 


5093 / 912 


26/4 


1679 / 298 


GS 


75 / 12 


56 / 13 


33/8 


419 / 65 


252 / 58 


173 / 40 


1820 / 280 


621 / 143 


777 / 181 


GSJ 


75 / 13 


59 / 14 


29/7 


417 / 74 


335 / 78 


108 / 27 


1816 / 328 


1478 / 343 


297 / 70 


70 


S 


202 / 26 


17/3 


66/9 


1164 / 152 


26/4 


392 / 56 


5101 / 704 


32/5 


1746 / 247 


J 


201 / 30 


17/3 


65/9 


1161 / 175 


26/4 


391 / 58 


5090 / 766 


32/4 


1745 / 251 


GS 


75/9 


58 / 11 


32/7 


420 / 55 


278 / 56 


175 / 36 


1820 / 235 


786 / 156 


766 / 154 


GSJ 


75 / 11 


59 / 12 


32/7 


417 / 63 


337 / 68 


145 / 30 


1812 / 274 


1487 / 300 


367 / 78 


60 


S 


201 / 23 


29/3 


68/8 


1163 / 124 


47/6 


398 / 47 


5096 / 544 


57/6 


1766 / 207 


J 


201 / 25 


29/3 


67/8 


1160 / 144 


47/5 


399 / 48 


5083 / 632 


57/7 


1767 / 209 


GS 


75/8 


56/9 


32/6 


419 / 44 


226 / 37 


182 / 30 


1817 / 193 


477 / 78 


785 / 131 


GSJ 


75 / 10 


60 / 10 


34/5 


416 / 52 


341 / 57 


127 / 21 


1809 / 225 


1503 / 248 


361 / 61 


50 


S 


202 / 19 


45/4 


68/7 


1163 / 110 


80/8 


396 / 42 


5096 / 481 


99 / 10 


1756 / 185 


J 


201 / 23 


45/5 


67/9 


1160 / 127 


80/9 


396 / 49 


5082 / 556 


99 / 10 


1759 / 210 


GS 


76/7 


57/9 


31/5 


419 / 40 


251 / 39 


162 / 25 


1818 / 171 


605 / 90 


725 / 111 


GSJ 


75/8 


59/9 


32/6 


416 / 46 


335 / 50 


124 / 20 


1806 / 207 


1479 / 220 


239 / 37 


40 


S 


202 / 16 


51/4 


69/7 


1163 / 93 


108 / 9 


400 / 35 


5095 / 429 


143 / 12 


1765 / 155 


J 


201 / 19 


50/5 


68/9 


1159 / 106 


108 / 10 


399 / 40 


5077 / 483 


142 / 12 


1767 / 173 


GS 


76/6 


56/7 


32/4 


420 / 33 


236 / 29 


179 / 23 


1819 / 143 


526 / 65 


742 / 94 


GSJ 


75/7 


60/8 


32/4 


416 / 40 


334 / 41 


141 / 19 


1806 / 168 


1472 / 184 


235 / 30 


30 


S 


203 / 13 


62/4 


69/5 


1163 / 73 


170 / 12 


399 / 28 


5094 / 318 


252 / 17 


1763 / 123 


J 


202 / 14 


62/4 


68/6 


1158 / 83 


169 / 13 


398 / 32 


5071 / 365 


251 / 18 


1765 / 136 


GS 


77/5 


57/6 


35/3 


421 / 26 


235 / 24 


174 / 18 


1823 / 114 


514 / 51 


769 / 79 


GSJ 


76/5 


60/6 


35/4 


416 / 31 


333 / 34 


159 / 16 


1801 / 129 


1466 / 148 


408 / 21 


20 


S 


204 / 9 


76/3 


68/3 


1163 / 52 


283 / 13 


400 / 21 


5085 / 226 


525 / 25 


1772 / 89 


J 


202 / 10 


75/4 


68/3 


1155 / 59 


280 / 15 


399 / 23 


5052 / 258 


520 / 27 


1767 / 101 


GS 


78/4 


57/4 


36/3 


422 / 18 


216 / 16 


182 / 13 


1823 / 81 


420 / 31 


768 / 59 


GSJ 


77/4 


60/4 


34/2 


414 / 22 


330 / 25 


149 / 12 


1789 / 93 


1448 / 107 


298 / 23 


10 


S 


204 / 5 


82/3 


69/2 


1162 / 32 


361 / 11 


400 / 13 


5065 / 138 


1091 / 31 


1772 / 56 


J 


201 / 6 


80/3 


68/2 


1147 / 34 


355 / 13 


398 / 15 


4997 / 151 


1071 / 37 


1767 / 63 


GS 


80/3 


57/2 


37/2 


430 / 11 


248 / 12 


189 / 10 


1838 / 49 


674 / 32 


796 / 40 


GSJ 


78/3 


61/3 


37/2 


414 / 13 


332 / 16 


164 / 8 


1770 / 54 


1438 / 69 


429 / 22 



f: The density of the transition probability matrix (%). 

p. Two values in these columns correspond to the number of iterations and CPU time of the corresponding 
algorithm. 



Table 2: The number of iterations and CPU time of the standard, Jacobi, Gauss-Seidel, and 
Gauss-Seidel-Jacobi value iteration algorithrjig with and without an accelerating operator 
applied to a family of MDPs from Example [2] 



